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Abstract 

Computational  Grids  have  become  an  important  and 
popular  computing  platformfor  both  scientific  and  commer¬ 
cial  distributed  computing  communities.  However,  users  of 
such  systems  typically  find  achievement  of  application  ex¬ 
ecution  performance  remains  challenging.  Although  Grid 
infrastructures  such  as  Legion  and  Globus  provide  basic  re¬ 
source  selection  functionality,  work  allocation  functional¬ 
ity,  and  scheduling  mechanisms,  applications  must  interpret 
system  performance  information  in  terms  of  their  own  re¬ 
quirements  in  order  to  develop  performance-efficient  sched¬ 
ules. 

We  describe  a  new  high-performance  scheduler  that  in¬ 
corporates  dynamic  system  information,  application  re¬ 
quirements,  and  a  detailed  performance  model  in  order  to 
create  performance  efficient  schedules.  While  the  sched¬ 
uler  is  designed  to  provide  improved  performance  for  a 
magneto  hydrodynamics  simulation  in  the  Legion  Compu¬ 
tational  Grid  infrastructure,  the  design  is  generalizable  to 
other  systems  and  other  data-parallel,  iterative  codes.  We 
describe  the  adaptive  performance  model,  resource  selec¬ 
tion  strategies,  and  scheduling  policies  employed  by  the 
scheduler.  We  demonstrate  the  improvement  in  application 
performance  achieved  by  the  scheduler  in  dedicated  and 
shared  Legion  environments. 


This  research  was  supported  in  part  by  DARPA  Contract#N66001- 
97-C-S531,  DoD  Modernization  Contract  9720733-00,  and  NSF/NPACI 
Grant  ASC-9619020 


1.  Introduction 

Computational  Grids  [7]  are  rapidly  becoming  an  impor¬ 
tant  and  popular  computing  platform  for  both  scientific  and 
commercial  distributed  computing  communities.  Grids  in¬ 
tegrate  independently  administered  machines,  storage  sys¬ 
tems,  databases,  networks,  and  scientific  instruments  with 
the  goal  of  providing  greater  delivered  application  perfor¬ 
mance  than  can  be  obtained  from  any  single  site.  There 
are  many  critical  research  challenges  in  the  development  of 
Computational  Grids  as  an  effective  computing  platform. 
For  users,  both  performance  and  programmability  of  the  un¬ 
derlying  infrastructure  are  essential  to  the  successful  imple¬ 
mentation  of  applications  in  Grid  environments. 

The  Legion  Computational  Grid  infrastructure  [11]  pro¬ 
vides  a  sophisticated  object-oriented  programming  envi¬ 
ronment  that  promotes  application  programmability  by 
enabling  transparent  access  to  Grid  resources.  Legion 
provides  basic  resource  selection,  work  allocation,  and 
scheduling  mechanisms.  In  order  to  achieve  desired  per¬ 
formance  levels,  applications  (or  their  users)  must  inter¬ 
pret  system  performance  information  in  terms  of  require¬ 
ments  specific  to  the  target  application.  Application  Level 
Scheduling  (AppLeS)  [3]  is  an  established  methodology 
for  developing  adaptive,  distributed  programs  that  execute 
in  dynamically  changing  and  heterogeneous  execution  set¬ 
tings.  The  ultimate  goal  of  this  work  is  to  draw  upon  the 
AppLeS  and  Legion  Computational  Grid  research  efforts  to 
design  an  adaptive  application  scheduler  for  regular  itera¬ 
tive  stencil  codes  in  Legion  environments. 

We  consider  a  general  class  of  regular,  data-parallel  sten¬ 
cil  codes  which  require  repeated  applications  of  relatively 
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constant-time  operations.  Many  of  these  codes  have  the  fol¬ 
lowing  structure: 

Initialization 

Loop  over  an  n-dimensional  mesh 

Finalization 

in  which  the  basic  activity  of  the  loop  is  a  stencil  based  com¬ 
putation.  In  other  words  the  data  items  in  the  n-dimensional 
mesh  are  updated  based  on  the  values  of  their  nearest  neigh¬ 
bors  in  the  mesh.  Such  codes  are  common  in  scientific  com¬ 
puting  and  include  parallel  implementations  of  matrix  oper¬ 
ations  as  well  as  routines  found  in  packages  such  as  ScaLA- 
PACK  [18]. 

In  this  paper  we  focus  on  the  development  of  an  adaptive 
strategy  for  scheduling  a  regular,  data-parallel  stencil  code 
called  PMHD3D  on  the  Legion  Grid  infrastructure.  The 
primary  contributions  of  this  paper  are: 

•  We  describe  an  adaptive  performance  model  for 
PMHD3D  and  demonstrate  its  ability  to  predict  appli¬ 
cation  performance  in  initial  experiments.  The  perfor¬ 
mance  model  represents  the  application’s  requirements 
for  computation,  communication,  overhead,  and  mem¬ 
ory,  and  could  easily  be  extended  to  serve  more  gener¬ 
ally  as  a  framework  for  regular  iterative  stencil  codes 
in  Grid  environments. 

•  We  couple  the  PMHD3D  performance  model  with  re¬ 
source  selection  strategies,  schedule  selection  policies, 
and  deployment  software  to  form  an  AppLeS  sched¬ 
uler  for  PMHD3D. 

•  In  order  to  satisfy  the  requirements  of  the  PMHD3D 
performance  model  we  implement  and  utilize  a  new 
memory  sensor  as  part  of  the  Network  Weather  Ser¬ 
vice  (NWS)  [22],  The  sensor  collects  measurements 
and  produces  forecasts  of  the  amount  of  free  memory 
available  on  a  processor. 

•  We  demonstrate  the  ability  of  the  AppLeS  method¬ 
ology  to  provide  enhanced  performance  for  the 
PMHD3D  application,  using  the  Legion  software  in¬ 
frastructure  as  a  platform  for  high-performance  appli¬ 
cation  execution. 

In  the  next  section  we  discuss  the  structure  of  the  target 
application  and  the  environment  that  we  used  as  a  test-bed. 
In  Section  3,  we  discuss  the  AppLeS  we  have  designed  for 
PMHD3D  and  provide  a  generalizable  performance  model. 
Section  4  provides  experimental  results  and  demonstrates 
performance  improvements  we  achieved  via  AppLeS  using 
Legion.  In  Sections  5  and  6  we  review  related  work  and 
investigate  possible  new  directions,  respectively. 


2.  Research  Components:  AppLeS,  NWS, 
PMHD3D  and  Legion 

In  order  to  build  a  high-performance  scheduler  for 
PMHD3D  we  leveraged  application  characteristics,  dy¬ 
namic  resource  information  from  NWS,  the  AppLeS 
methodology,  and  the  Legion  system  infrastructure.  In  this 
section  we  explain  each  of  these  components  in  detail. 

2.1.  AppLeS 

The  AppLeS  project  focuses  on  the  development  of  a 
methodology  and  software  for  achieving  application  per¬ 
formance  via  adaptive  scheduling  [1].  For  individual  ap¬ 
plications,  an  AppLeS  is  an  agent  that  integrates  with  the 
application  and  uses  dynamic  and  application-specific  in¬ 
formation  to  develop  and  deploy  a  customized  adaptive  ap¬ 
plication  schedule.  For  structurally  similar  classes  of  appli¬ 
cations,  an  AppLeS  template  provides  a  “pluggable”  frame¬ 
work  which  comprises  a  class-specific  performance  model, 
scheduling  model,  and  deployment  module.  An  applica¬ 
tion  from  the  class  can  be  instantiated  within  the  template 
to  form  a  performance-oriented  self-scheduling  application 
targeted  to  the  underlying  Grid  resources. 

AppLeS  schedulers  often  rely  on  available  tools  in  order 
to  deploy  the  schedule  or  to  gather  information  on  resources 
or  environment.  AppLeS  commonly  depends  on  the  Net¬ 
work  Weather  Service  (NWS)  (see  Section  2.4)  to  provide 
dynamic  predictions  of  resource  load  and  availability.  To¬ 
gether,  AppLeS  and  the  Network  Weather  Service  can  be 
used  to  adapt  application  performance  to  the  deliverable  ca¬ 
pacities  of  Grid  resources  at  execution  time.  In  this  project 
AppLeS  uses  Legion  to  execute  a  schedule  and  the  Internet 
Backplane  Protocol  (IBP)  [13]  to  effectively  cache  the  data 
coming  from  NWS. 

2.2.  PMHD3D 

The  target  application  for  this  work,  PMHD3D  [12,  15], 
is  a  magnetohydrodynamics  simulation  developed  at  the 
University  of  Virginia  Department  of  Astronomy  by  John 
F.  Hawley  and  ported  to  Legion  by  Greg  Lindhal.  The  code 
is  an  MPI  FORTRAN  stencil-based  application  and  shares 
many  characteristics  with  other  stencil  codes.  The  code  is 
structured  as  a  three-dimensional  mesh  of  data,  upon  which 
the  same  computation  is  iteratively  performed  on  each  point 
using  data  from  its  neighbors.  PMHD3D  alternates  between 
CPU-intensive  computation  and  communication  (between 
“slab”  neighbors  and  for  barrier  synchronizations). 

At  startup  PMHD3D  reads  a  configuration  file  that  spec¬ 
ifies  the  problem  size  and  the  target  number  of  processors. 
Since  the  other  two  dimensions  are  fixed  in  PMHD3D’s 
three-dimensional  mesh,  we  refer  to  the  height  of  the  mesh 
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Figure  1.  PMHD3D  run-time  scenarios  with  and  without  AppLeS. 


as  the  problem  size.  In  order  to  allocate  work  among  proces¬ 
sors  in  the  computation  the  mesh  is  divided  into  horizontal 
slabs  such  that  each  processor  receives  a  slab.  For  load  bal¬ 
ancing  purposes  each  processor  can  be  assigned  a  different 
amount  of  work  (by  dividing  the  work  into  slabs  of  vary¬ 
ing  height).  The  AppLeS  scheduler  determines  the  optimal 
height  of  each  slab  depending  on  the  raw  speed  of  the  pro¬ 
cessor  and  on  NWS  forecasts  of  CPU  load,  the  amount  of 
free  memory,  and  network  conditions.  AppLeS  is  dynamic 
in  the  sense  that  the  data  used  by  the  scheduler  is  computed 
and  collected  just  before  execution,  but  once  the  schedule  is 
created  and  implemented,  the  execution  currently  proceeds 
without  interaction  with  the  AppLeS. 

2.3.  Legion 

Legion,  a  project  at  the  University  of  Virginia,  is  de¬ 
signed  to  provide  users  with  a  transparent,  secure,  and  re¬ 
liable  interface  to  resources  in  a  wide-area  system,  both  at 
the  programming  interface  level  as  well  as  at  the  end-user 
level  [9,  14],  Both  the  programmer  and  the  end-user  have 
coherent  and  seamless  access  to  all  the  resources  and  ser¬ 
vices  managed  by  Legion.  Legion  addresses  challenging 
issues  in  Computational  Grid  research  such  as  parallelism, 
fault-tolerance,  security,  autonomy,  heterogeneity,  legacy 
code  management,  resource  management,  and  access  trans¬ 
parency. 

Legion  provides  mechanisms  and  facilities,  leaving  to 
the  programmer  the  implementation  of  the  policies  to  be 
enforced  for  a  particular  task.  Following  this  idea,  schedul¬ 
ing  in  Legion  is  flexible  and  can  be  tailored  to  suit  applica¬ 


tions  with  different  requirements.  The  main  Legion  compo¬ 
nents  involved  in  scheduling  are  the  collection ,  the  enactor , 
the  scheduler ,  and  the  hosts  which  will  execute  the  sched¬ 
ule  [5].  The  collection  provides  information  about  the  avail¬ 
able  resources  and  the  scheduler  selects  the  resources  to  be 
used  in  a  schedule.  The  schedule  is  then  given  to  the  enac¬ 
tor,  which  contacts  the  host  objects  involved  in  the  sched¬ 
ule  and  attempts  to  execute  the  application.  This  scheme 
provides  scheduling  flexibility;  for  example,  in  case  of  host 
failures,  the  enactor  can  ask  the  scheduler  for  a  new  sched¬ 
ule  and  continue  despite  the  failure,  the  collection  can  re¬ 
turn  subsets  of  the  resources  depending  on  the  user  and/or 
the  application,  or  the  hosts  can  refuse  to  serve  a  specific 
user. 

Legion  currently  provides  default  implementations  of  all 
the  objects  described  herein.  Moreover,  new  objects  can 
be  developed  and  used  rather  than  the  default  ones.  Note 
that  the  PMHD3D  AppLeS  is  developed  “on  top”  of  Le¬ 
gion,  and  uses  default  Legion  objects.  We  would  expect 
the  performance  improvement  for  such  a  code  to  conserva¬ 
tively  bound  from  below  that  which  would  be  achievable  if 
the  AppLeS  were  structured  as  a  Legion  object.  We  plan 
to  eventually  develop  the  AppLeS  described  here  as  a  Le¬ 
gion  scheduling  object  for  a  class  of  regular,  iterative,  data- 
parallel  applications. 

2.4.  Network  Weather  Service 

The  Network  Weather  Service  [17,  22]  is  a  distributed 
system  that  periodically  monitors  and  dynamically  fore¬ 
casts  the  performance  various  network  and  computational 


resources  can  deliver.  NWS  is  composed  of  sensors,  mem¬ 
ories  and  forecasters.  Sensors  measure  the  availability  of 
the  resource,  for  example  CPU  availability,  and  then  record 
the  measurement  in  a  NWS  memory.  In  response  to  a  query, 
the  NWS  software  will  return  a  time  series  of  measurements 
from  any  activated  sensor  in  the  system.  This  time  series 
can  then  be  passed  to  the  NWS  forecaster  which  predicts 
the  future  availability  of  the  resource.  The  forecaster  tests 
a  variety  of  predictors  and  returns  the  result  and  expected 
error  of  the  most  accurate  predictor.  To  obtain  better  per¬ 
formance  for  PMHD3D  we  developed  a  memory  sensor 
that  measures  the  available  free  memory  of  a  machine.  The 
sensor  has  been  extended  and  is  now  part  of  NWS. 

2.5.  Interactions  Among  System  Components 

PMHD3D  can  directly  access  Legion’s  scheduling  fa¬ 
cilities  or  can  use  AppLeS  to  obtain  a  more  performance- 
efficient  schedule.  Figure  1  shows  the  interactions  among 
components  in  each  of  these  scenarios.  The  dotted  line  rep¬ 
resents  the  scheduling  of  a  PMHD3D  run  without  AppLeS 
facilities:  the  user  supplies  the  number  of  processors,  the 
processor  list,  and  the  associated  problem  size  per  proces¬ 
sor  and  the  rest  of  the  scheduling  process  is  supplied  by  a 
default  scheduler  within  the  Legion  infrastructure. 

When  the  application  uses  AppLeS  for  scheduling,  the 
interactions  among  components  can  instead  be  represented 
by  the  solid  lines  in  Figure  1.  In  this  case  the  user  sup¬ 
plies  only  the  problem  size  of  interest.  AppLeS  collects 
the  list  of  available  resources  from  the  environment  (via  the 
Legion  collection  object  or,  in  our  case,  via  the  Legion  con¬ 
text  space),  and  then  queries  NWS  to  obtain  updated  per¬ 
formance  and  availability  predictions  for  the  available  re¬ 
sources.  As  the  figure  shows,  AppLeS  collects  the  NWS 
predictions  as  an  IBP  client:  the  predictions  are  pushed  into 
the  IBP  server  by  a  separate  process. 

AppLeS  then  creates  a  performance-promoting  adaptive 
schedule  and  asks  the  Legion  scheduler  to  execute  it.  The 
schedule  is  adaptive  because  AppLeS  assigns  a  different 
amount  of  work  to  each  processor  depending  on  their  pre¬ 
dicted  performance.  As  is  suggested  by  the  figure,  the 
PMHD3D  AppLeS  is  built  on  top  of  Legion  facilities.  A  fu¬ 
ture  goal  is  to  integrate  the  AppLeS  as  an  alternative  sched¬ 
uler  in  Legion  for  the  class  of  regular,  data-parallel,  stencil 
applications. 

3.  The  PMHD3D  AppLeS 

The  general  AppLeS  approach  is  to  create  good  sched¬ 
ules  for  an  application  by  incorporating  application  spe¬ 
cific  characteristics,  system  characteristics,  and  dynamic 
resource  performance  data  in  scheduling  decisions.  The 
PMHD3D  AppLeS  draws  upon  the  general  AppLeS 


methodology  [3]  and  the  experience  gained  building  an  Ap¬ 
pLeS  for  a  structurally  similar  Jacobi-2D  application  [2]. 

Conceptually,  the  PMHD3D  AppLeS  can  be  decom¬ 
posed  into  three  components: 

•  a  performance  model  that  accurately  represents  ap¬ 
plication  performance  within  the  Computational  Grid 
environment; 

•  a  resource  selection  strategy  that  identifies  poten¬ 
tially  performance-efficient  candidate  resource  sets 
from  those  that  are  available  at  run  time; 

•  a  schedule  creation  and  selection  strategy  that  cre¬ 
ates  a  good  schedule  for  each  of  the  various  candidate 
resource  sets  and  then  selects  the  most  performance- 
efficient  schedule. 

The  overall  strategy  and  organization  of  the  scheduler 
will  be  discussed  here  but  the  details  of  each  component  are 
reserved  for  the  following  sections. 

An  accurate  performance  model  (Section  3.1)  is  funda¬ 
mental  for  the  development  of  good  schedules.  The  per¬ 
formance  model  is  used  in  two  important  ways,  the  first  of 
which  is  to  guide  the  creation  of  schedules  for  specific  re¬ 
source  sets.  For  example,  load  balancing  is  a  necessary  con¬ 
dition  developing  an  efficient  schedule  but  is  difficult  or  im¬ 
possible  to  achieve  without  an  estimate  of  the  relative  costs 
of  computation  on  various  resources.  An  accurate  perfor¬ 
mance  model  is  also  necessary  for  selection  of  the  highest 
performance  schedule  from  a  set  of  candidate  schedules. 

The  resource  selection  strategy  (Section  3.2)  produces 
several  orderings  of  available  resources  based  on  differ¬ 
ent  concepts  of  “desirability”  of  resources  to  PMHD3D. 
Our  definitions  of  desirability  incorporate  Legion  re¬ 
source  discovery  results,  dynamic  resource  availability  from 
NWS,  dynamic  performance  forecasts  from  NWS,  and 
application-specific  performance  data  for  each  resource. 
Once  complete,  the  ordered  lists  of  resources  are  passed  on 
to  the  schedule  creation  and  selection  component  of  the  Ap¬ 
pLeS. 

The  schedule  creation  step  (Section  3.3)  takes  the  pro¬ 
posed  resource  lists  and  creates  a  good  schedule  for  each 
based  on  the  constraints  the  system  and  application  im¬ 
pose.  System  constraints  are  characteristics  such  as  avail¬ 
able  memory  of  the  resources  while  the  application  con¬ 
straints  are  characteristics  such  as  the  amount  of  memory  re¬ 
quired  for  the  application  to  remain  in  main  memory.  Once 
all  schedules  have  been  created  the  performance  model  is 
used  to  select  the  highest  performance  schedule  (the  one  in 
which  the  execution  time  is  expected  to  be  the  lowest). 

The  decomposition  of  the  scheduling  process  into  these 
disjoint  steps  provides  an  overly  simplistic  view  of  the  in¬ 
teractions  between  steps.  In  reality  the  scheduling  process 


1  Rset  =  getResourceSet  ( ) 

2  NWS_data  =  NWS (Rset) 

3  C  =  getScheduleConstraints  ( ) 

4  for  (balance  =  {0,  0.5,  1}) 

5  S  =  sort (Rset,  balance,  maxP) 

6  for  (n  =  2. .maxP) 

7  sched  =  findSched(n,  S,  NWS_data,  C) 

8  while  (sched  is  not  found) 

9  "Schedule  constraints  are  too  restrictive" 

10  relaxConstraints  (C) 

11  sched  =  findSched(n,  S,  NWS_data,  C) 

12  endwhile 

13  if  (cost (sched)  <  best) 

14  best  =  sched 

15  endif 

16  endfor 

17  endfor 

18  run  (best) 


\\  Available  resources  obtained  from  Legion 
\\  NWS  forecasts  of  resource  performance 
\\  Obtain  scheduling  constraints  for  simplex 
\\  Select  for  CPU  power,  connectivity,  both 
\\  Returns  list  of  hosts  sorted  by  desirability 
\\  Searching  for  correct  number  of  processors 
\\  Use  simplex  to  find  schedule  on  S  using  C 
\\  Simplex  was  unsolvable  with  S  and  C 

\\  More  schedule  flexibility,  more  possible  error 
\\  Try  to  find  schedule  again 
\\  Found  a  feasible  schedule 

\\  If  best  one  so  far  keep  it,  else  throw  away 


\\  Best  schedule  found,  run  it 


Figure  2.  PMHD3D  AppLeS  pseudo-code. 


requires  more  complicated  interactions.  To  accurately  rep¬ 
resent  the  true  interaction  of  the  scheduling  components  we 
present  a  pseudo-code  version  of  the  PMHD3D  AppLeS 
strategy  in  Figure  2.  The  steps  shown  in  Figure  2  will  be¬ 
come  clearer  in  the  following  sections. 

3.1.  Performance  Model 


The  goal  of  the  performance  model  is  to  accurately  pre¬ 
dict  the  execution  time  of  PMHD3D.  Since  the  run-time 
may  vary  somewhat  from  processor  to  processor,  we  take 
the  maximum  run-time  of  any  processor  involved  in  the 
computation  as  the  overall  run-time.  During  every  iteration 
each  processor  computes  on  its  slab  of  data,  communicates 
with  its  neighbors,  and  synchronizes  with  all  other  proces¬ 
sors. 

Formally,  the  running  time  for  processor  i  is  given  by: 

Ti  =  Compi  +  Comnii  -I-  Overt 


where  Compt ,  Comrrii  and  Overt  are  the  predicted  com¬ 
putation  time,  the  predicted  communication  time,  and  the 
estimated  overhead  for  Pt ,  respectively. 

Computation  time  is  directly  related  to  the  units  of  work 
assigned  to  a  processor  (in  other  words  the  height  of  the 
slab)  and  to  the  speed  of  that  processor.  The  computation 
time  for  Pt  is: 


Compt  = 


xt  *  BMt 
Availt 


where  Xt  is  the  amount  of  work  allocated  to  processor  Pt 
(dynamically  determined  by  the  scheduling  process),  BMt 
is  a  benchmark  for  the  application-specific  speed  of  Pt' s 
processor  configuration,  and  Availt  is  a  forecast  of  the  CPU 
load  on  processor  P,  (obtained  from  dynamic  NWS  fore¬ 
casts).  To  obtain  the  benchmarks,  we  run  PMHD3D  on 


dedicated  machines  with  various  problem  sizes  and  vari¬ 
able  number  of  hosts.  Execution  times  were  proportional 
to  problem  size  and  are  given  in  terms  of  seconds  per  point 
on  each  platform. 

Communication  time  is  modeled  as  the  time  required 
for  transferring  data  to  neighboring  processors  across  the 
available  network.  This  represents  communication  for  all 
iterations  and  accounts  for  both  the  time  to  establish  a  con¬ 
nection  and  the  time  to  transfer  the  messages.  To  simplify 
the  communication  model,  we  have  not  attempted  to  di¬ 
rectly  predict  synchronization  time  or  the  time  a  processor 
waits  for  a  communication  partner.  We  hope  instead  to  cap¬ 
ture  the  effect  of  these  communication  costs  in  our  estimate 
of  overhead  costs,  which  we  discuss  shortly.  Communica¬ 
tion  time  is  then: 

Comnit  —  Af B / (bi^t-\-i  -t-  bt^t— i)  -t-  M)  ^  T  lt,t— i) 

where  MB  is  the  total  megabytes  transfered,  M  is  the  num¬ 
ber  of  messages  transfered,  and  btj  and  Itj  are  predictions 
of  available  bandwidth  and  latency  from  Pt  to  Pj,  respec¬ 
tively.  Predictions  of  available  bandwidth  and  latency  be¬ 
tween  pairs  of  processors  are  obtained  from  dynamic  NWS 
forecasts.  To  provide  an  estimate  of  the  number  of  mes¬ 
sages  transferred  (M)  and  the  megabytes  transferred  (MB) 
we  examined  post-execution  program  performance  reports 
provided  by  Legion.  For  a  variety  of  problem  sizes  and  re¬ 
source  set  sizes  the  number  of  megabytes  transferred  var¬ 
ied  by  less  than  5%  so  we  used  an  average  value  for  all 
runs.  Data  transfer  does  not  significantly  vary  with  prob¬ 
lem  size  because  the  problem  size  affects  only  the  height  of 
the  grid  while  the  decomposition  is  performed  horizontally. 
Data  transfer  costs  also  do  not  vary  with  number  of  proces¬ 
sors  because  each  processor  must  communicate  with  only 
its  neighbors,  regardless  of  the  total  number  of  processors. 
Although  the  number  of  messages  transferred  varied  more 


significantly  from  run-to-run  we  also  used  an  average  value 
for  this  variable.  This  approximation  did  not  adversely  af¬ 
fect  our  scheduling  ability  in  the  environments  we  tested;  in 
cases  where  communication  costs  are  more  severe  a  model 
could  be  developed  to  approximate  the  expected  number  of 
messages  transferred. 

The  overhead  factor  Otter*  is  included  in  the  perfor¬ 
mance  model  to  capture  application  and  system  behav¬ 
ior  that  cannot  be  accounted  for  by  a  simple  commu¬ 
nication/computation  model.  For  example,  a  processor 
will  likely  spend  time  synchronizing  with  other  processors, 
waiting  for  neighbor  processors  for  data  communication, 
and  waiting  for  system  delays.  System  overheads  are  as¬ 
sociated  with  specifics  of  the  hardware  and  Legion  infras¬ 
tructure  such  as  the  time  required  to  resolve  the  physical 
location  of  a  data  object  needed  by  the  application.  The 
overhead  for  PMHD3D  can  be  estimated  by: 

Overt  =  16  —  1.5  *  probSize/ 1000  +  0.094P2 

where  P  is  the  number  of  processors  involved  in  the  com¬ 
putation  and  probSize  is  the  height  of  the  PMHD3D  mesh. 

Overt  was  estimated  empirically  using  data  from  106 
individual  application  executions  with  problem  sizes  vary¬ 
ing  from  1000  to  6000  and  with  resource  set  sizes  vary¬ 
ing  between  4  and  26.  To  determine  the  effect  of  the  num¬ 
ber  of  processors  on  overhead  runs,  runs  were  grouped  by 
problem  size  and  the  corresponding  execution  times  plot¬ 
ted  against  number  of  processors.  For  each  set  of  runs  per¬ 
formed  with  the  same  problem  size,  a  quadratic  fit  was  per¬ 
formed  on  the  difference  between  the  actual  execution  time 
and  the  predicted  execution  time  (without  the  overhead  fac¬ 
tor).  The  quadratic  factor  varied  between  0.090  and  0.096 
with  a  mean  of  0.094  (standard  deviation  of  0.0022).  To 
determine  the  effect  of  problem  size  on  overhead  we  used 
the  same  runs  but  did  a  linear  datafit  on  the  predicted/actual 
execution  time  difference  with  problem  size. 

3.2.  Resource  Selection 

Resource  selection  is  the  process  of  selecting  a  set 
of  target  resources  (processors  in  this  case)  that  will  be 
performance-efficient.  Finding  the  optimal  set  of  resources 
requires  comparing  all  possible  schedules  on  all  possible 
subsets  of  the  resource  pool  -  clearly  an  inefficient  pro¬ 
cess  as  the  resource  pool  becomes  large.  Instead,  we  create 
several  ordered  lists  of  resources  by  employing  a  heuristic 
to  sort  candidate  resources  in  terms  of  several  definitions 
of  resource  desirability.  Resource  desirability  is  based  on 
how  resource  characteristics  such  as  computational  speed 
and  network  connectivity  will  affect  the  performance  of 
PMHD3D. 

The  resource  selection  process  begins  by  querying  Le¬ 
gion  to  discover  the  available  set  of  resources.  Effec¬ 


tive  evaluation  of  the  desirability  of  each  resource  requires 
application-specific  performance  information  as  well  as  dy¬ 
namic  resource  performance  information.  As  of  this  writ¬ 
ing,  Legion  collection  objects  report  available  resources  and 
their  static  configurations  but  do  not  provide  up-to-date  dy¬ 
namic  information  on  availability,  load,  or  connectivity.  Ac¬ 
cordingly,  the  list  of  available  resources  reported  by  Legion 
is  used  to  query  NWS  for  dynamic  forecasts  of  resource 
availability,  CPU  load,  and  free  memory  for  each  host  and 
of  latency  and  bandwidth  between  all  pairs  of  hosts.  To  ob¬ 
tain  the  computational  cost  per  unit  of  the  PMHD3D  grid 
on  each  type  of  resource  we  used  the  benchmarking  method 
described  in  Section  3.1. 

Once  the  available  resource  lists  and  the  dynamic  sys¬ 
tem  characteristics  are  collected,  the  list  can  be  ordered  in 
terms  of  desirability.  We  use  three  definitions  of  desirability 
of  a  resource:  desirability  based  on  connectivity,  desirabil¬ 
ity  based  on  computational  power,  and  desirability  based 
equally  on  the  two  characteristics.  Connectivity  is  approx¬ 
imated  by  computing  the  latency  and  bandwidth  between 
the  resource  in  question  and  all  other  resources  in  the  re¬ 
source  pool:  as  a  metric  we  calculate  the  amount  of  time 
(seconds)  it  would  take  for  the  resource  in  question  to  ex¬ 
change  a  packet  of  size  1  byte  to  and  from  every  other  host. 
Computational  power  is  measured  by  the  time  (seconds)  it 
would  take  the  host  to  compute  1  point  for  1  iteration  based 
on  the  NWS  predictions  and  the  benchmarks  we  discussed 
earlier.  The  balanced  strategy  orders  the  resources  based  on 
an  average  of  computational  power  and  connectivity. 

The  resource  set  is  sorted  into  3  resource  lists  using  the 
3  notions  of  resource  desirability.  We  then  create  subsets  of 
the  lists  by  selecting  the  n  most  desirable  hosts  from  each 
list  where  n  =  2 ...maxP  and  n  is  even.  We  select  multiple 
subsets  from  each  list  because  it  is  often  impossible  to  know 
the  optimal  number  of  hosts  a  priori.  Once  the  subsets  have 
been  created  the  resulting  group  of  proposed  resource  sets 
are  passed  on  to  the  schedule  creation  step  described  in  the 
next  section.  Although  the  approach  described  here  is  not 
guaranteed  to  find  the  optimal  resource  set,  the  methodol¬ 
ogy  provides  a  scalable  and  performance-efficient  approach 
to  resource  selection. 

3.3.  Schedule  Creation  and  Selection 

For  each  of  the  proposed  resource  sets,  a  schedule  is  de¬ 
veloped.  Essentially,  schedule  development  on  a  given  re¬ 
source  set  for  PMHD3D  reduces  to  finding  a  work  alloca¬ 
tion  that  provides  good  time  balancing.  As  in  Section  3.1 
work  allocation  is  represented  by  xt  and  is  the  height  of  the 
slab  given  to  processor  P*. 

One  of  the  most  important  characteristics  for  any  solu¬ 
tion  to  this  problem  is  time  balancing:  all  processors  should 
finish  at  the  same  time.  Using  the  notation  from  Section  3.1, 


Ti  =  Tj+i,  i  £  {1 . . .  (n  —  1)}  and,  since  all  of  the 
work  must  be  allocated,  we  also  have  JA  Xj  =  probSize. 
Taken  together  we  have  n  equations  in  n  unknowns  and  the 
problem  can  be  solved  with  a  basic  linear  solver.  This  ap¬ 
proach  was  successful  for  the  Jacobi-2D  AppLeS  [2]  but  is 
not  powerful  enough  to  incorporate  several  additional  con¬ 
straints  required  to  develop  good  schedules  for  PMHD3D. 

One  of  the  important  constraints  for  PMHD3D  perfor¬ 
mance  is  the  amount  of  memory  available  for  the  applica¬ 
tion.  There  is  a  limit  to  the  size  of  problem  that  can  be 
placed  on  a  machine  because  if  the  computation  spills  out 
of  memory,  performance  can  drop  by  two  orders  of  magni¬ 
tude.  To  quantify  this  constraint  a  benchmark  for  applica¬ 
tion  memory  usage  must  be  obtained  by  observing  memory 
usage  for  varying  problem  sizes  on  each  type  of  resource. 
Formally,  this  constraint  becomes: 

BMmenii  *  Xi  <  MemAvaili 

where  MemAvaili  is  the  available  memory  for  processor 
i  (provided  by  the  NWS  memory  sensor)  and  BMmemi  is 
the  memory  benchmark  (megabytes/unit)  recorded  for  pro¬ 
cessor  i’s  architecture. 

We  formalize  the  work  allocation  constraints  as  a  Lin¬ 
ear  Programming  problem  (from  now  on  simply  LP),  solv¬ 
able  with  the  simplex  method  [6],  In  short,  LP  solves  the 
problem  of  finding  an  extreme  (maximum  or  minimum)  of 
a  function  f(x i,x%, . . .  ,xn)  where  the  unknowns  have  to 
satisfy  a  set  of  constraints  g(x i,  £2,  ■  ■  ■ ,  xn )  >  b  and  both 
the  objective  function  and  the  constraints  are  linear.  The 
simplex  is  a  well-known  method  used  to  solve  LP  prob¬ 
lems.  The  simplex  formulation  requires  that  constraints  are 
expressed  in  standard  form:  that  is  the  constraints  must  be 
expressed  as  equalities  and  each  variable  is  assigned  a  non¬ 
negativity  sign  restriction.  There  is  a  simple  procedure  that 
can  be  used  to  transform  LP  problems  into  a  standard  form 
equivalent. 

We  modified  the  time  balancing  equations  to  provide 
some  flexibility  for  the  constraints  specification:  expected 
execution  time  for  any  processor  in  the  computation  must 
fall  within  a  small  percentage  of  the  expected  total  running 
time.  This  flexibility  is  beneficial,  especially  as  additional 
constraints  such  as  memory  limits  are  incorporated  into  the 
problem  formulation.  The  constraints  are  initially  very  rigid 
but  can  be  relaxed  in  cases  where  no  solution  can  be  found 
given  the  initial  constraints.  The  time  balancing  equations 
and  the  application  memory  requirements  form  the  applica¬ 
tion  constraints  on  which  the  simplex  has  to  operate.  The 
simplex  formulation  also  requires  specification  of  an  objec¬ 
tive  function  where  the  goal  of  the  solver  is  to  maximize  the 
objective  function  while  satisfying  the  simplex  constraints. 
We  use  Yli  xi  as  the  objective  function  and  search  for  a  so¬ 
lution  where  all  work  is  allocated. 

For  each  of  the  proposed  resource  sets  the  simplex  is 


used  to  create  the  best  schedule  possible  for  that  resource 
set.  We  use  a  library  [16]  which  provides  a  fast  and  easy  to 
use  implementation  of  the  simplex.  There  are  several  bene¬ 
fits  of  using  linear  programming  and  the  simplex  method  to 
create  a  good  schedule: 

•  Linear  programming  is  well  known  and  commonly 
used  so  that  fast  and  reliable  algorithms  are  readily 
available. 

•  Once  the  constraints  are  formalized  as  a  linear  pro¬ 
gramming  problem,  adding  additional  constraints  is 
trivial.  For  example,  the  FORTRAN  compiler  used  to 
compile  PMHD3D  enforced  a  limit  on  the  maximum 
size  of  arrays,  therefore  limiting  the  maximum  units 
of  work  that  could  be  allocated  to  any  processor.  This 
constraint  was  easily  added  to  the  problem  formaliza¬ 
tion. 

•  The  linear  programming  problem  can  be  extended  to 
give  integer  solutions,  although  the  problem  then  be¬ 
comes  much  more  difficult.  Currently  the  solver  com¬ 
putes  real  values  for  work  allocation  and  we  redis¬ 
tribute  the  fractional  work  portions.  In  some  problems 
a  linear  solution  may  be  required  for  additional  accu¬ 
racy. 

•  In  the  case  that  a  solution  cannot  be  found,  the  simplex 
method  provides  important  feedback.  For  this  applica¬ 
tion,  the  simplex  could  not  find  a  solution  if  the  con¬ 
straints  were  too  restrictive.  In  this  case  the  simplex  is 
reiterated  with  successively  relaxed  constraints  until  a 
solution  can  be  reached. 

Once  the  proposed  schedules  are  identified,  schedule  se¬ 
lection  is  surprisingly  simple.  The  performance  model  is 
used  to  evaluate  the  expected  execution  time  of  each  pro¬ 
posed  schedule,  and  the  schedule  with  the  lowest  estimated 
execution  time  is  selected  and  implemented. 

4.  Results 

The  PMHD3D  AppLeS  has  been  implemented  and  we 
present  results  to  investigate  the  usefulness  of  the  method¬ 
ology.  The  goals  of  these  experiments  were  to: 

•  Evaluate  the  accuracy  of  our  performance  prediction 
model. 

•  Evaluate  the  ability  of  the  PMHD3D  AppLeS  to  pro¬ 
mote  application  performance  in  a  multi-user  Legion 
environment. 

The  previous  sections  stressed  the  importance  of  the  per¬ 
formance  model  for  effective  scheduling.  In  Section  4.2  we 
explain  in  detail  results  demonstrating  the  accuracy  of  the 


performance  model.  In  Section  4.3  we  present  evidence  that 
the  scheduling  methodology  and  implementation  are  effec¬ 
tive  in  practice.  Before  discussing  these  results  we  first  out¬ 
line  our  experimental  design. 

4.1.  Experimental  Design 

To  evaluate  the  PMHD3D  AppLeS,  we  conducted  ex¬ 
periments  on  the  University  of  Virginia  Centurion  Cluster, 
a  large  cluster  of  machines  maintained  by  the  Legion  team 
(see  [4]  for  more  information  on  the  cluster).  The  Centu¬ 
rion  Cluster  is  continuously  upgraded  for  new  Legion  ver¬ 
sion  releases;  during  the  3-month  period  of  the  experiments, 
we  used  Legion  versions  1.5  through  1.6.1.  The  cluster  it¬ 
self  is  composed  of  128  Alphas  and  128  Dual-Pentium  II 
PCs;  12  fast  Ethernet  switches  and  a  gigaswitch  connect 
the  whole  cluster.  Although  we  employed  both  Alphas  and 
Pentiums  during  the  development  and  initial  testing  process, 
we  had  multiple  difficulties  with  Alpha  Linux  kernel  insta¬ 
bilities  and  a  faulty  network  driver  which  made  our  data 
for  the  Alphas  machines  unreliable.  The  results  presented 
here  are  based  only  on  the  400  MHz  Dual  Pentium  II  ma¬ 
chines.  We  didn’t  employ  the  second  processor  on  the  Dual 
Pentium:  therefore  when  we  talk  about  host  or  machine  we 
consider  the  machines  to  be  uniprocessors.  It  is  worth  not¬ 
ing  that  many  users  only  use  one  processor  per  node  so  that 
even  a  computationally  intensive  user  will  not  affect  CPU 
availability  as  much  as  might  be  expected.  However,  the 
two  processors  on  each  Dual  Pentium  machine  utilize  the 
same  memory,  sometimes  leading  to  performance  degrada¬ 
tion  due  to  overloaded  memory  systems.  Inclusion  of  mem¬ 
ory  constraints  in  the  performance  model  helped  the  Ap¬ 
pLeS  scheduler  avoid  overloaded  memory  systems. 

We  restricted  our  experiments  to  34  machines  for  practi¬ 
cal  reasons:  the  dynamic  information  collected  from  NWS 
includes  a  large  amount  of  data,  even  for  a  relatively  small 
cluster.  Limiting  the  resource  pool  did  not  impact  inves¬ 
tigations  of  application  performance  or  schedule  efficiency 
because,  as  will  become  clear,  the  parallelism  available  in 
PMHD3D  for  the  problem  sizes  studied  here  is  well  below 
the  34  machine  limit.  As  explained  in  Section  2.5  we  used 
an  IBP  server  running  at  all  times  at  UCSD,  while  AppLeS 
acted  as  an  IBP  client  retrieving  the  forecasts.  This  setup 
allowed  us  to  obtain  updated  predictions  for  a  large  number 
of  resources  in  a  reasonable  amount  of  time.  On  average  it 
took  less  than  4  seconds  to  retrieve  the  data,  with  a  mini¬ 
mum  of  2.5  seconds  and  a  maximum  of  8.5  seconds. 

To  test  the  performance  of  PMHD3D  under  a  variety 
of  conditions,  experiments  were  typically  performed  with 
maximum  resource  set  sizes  (from  now  on  called  resource 
pool  or  simply  pool)  of  4, 6. ..26  and  problem  sizes  of 
1000,  2000. ..6000.  Problem  size  is  the  height  of  the  data 
grid  used  by  PMHD3D.  The  pool  is  the  maximum  num¬ 


ber  of  machines  the  scheduler  is  allowed  to  employ.  We 
test  varying  pool  sizes  to  simulate  conditions  under  which 
a  user  may  be  limited  to  a  certain  number  of  resources  by 
cost  or  access  considerations.  Although  our  overall  resource 
pool  contains  34  machines  in  total,  the  maximum  pool  size 
we  simulate  is  only  26.  This  choice  was  practical:  we  fre¬ 
quently  found  unavailable  or  inaccessible  machines  in  our 
overall  resource  pool  and  so  were  never  able  to  access  all 
34  machines  at  one  time.  Note  also  that  the  scheduler  may 
determine  that  utilizing  the  entire  pool  is  not  the  most  per¬ 
formance  efficient  choice.  In  this  case  the  pool  is  larger  than 
the  number  of  target  resources. 

The  experiments  presented  in  Section  4.2  were  con¬ 
ducted  under  unloaded  conditions  while  those  presented  in 
Section  4.3  were  conducted  under  loaded  conditions.  The 
ambient  load  present  during  most  of  our  loaded  runs  con¬ 
sisted  of  heavy  use  of  some  machines  and  light  use  of  oth¬ 
ers.  In  order  to  investigate  application  performance  we 
report  performance  results  based  on  application  execution 
time.  However,  there  is  a  cost  associated  with  using  Ap¬ 
pLeS  to  develop  a  schedule.  We  analyzed  43  runs  in  detail 
and  the  dominant  scheduling  cost  is  associated  with  query¬ 
ing  the  Legion  Collection  and  the  Legion  context  space. 
The  time  required  to  access  NWS  and  IBP  is  on  average  less 
than  4  seconds.  Once  the  system  and  performance  informa¬ 
tion  has  been  collected,  the  AppLeS  required  on  average 
roughly  1  second  to  order  the  resources,  create  schedules, 
and  select  the  best  schedule. 

4.2.  Performance  Model  Validation 

The  performance  model  is  the  basis  for  determining  a 
good  work  allocation  and,  more  importantly,  provides  the 
basis  for  selecting  a  final  schedule  among  those  that  have 
been  considered.  We  tested  model  accuracy  for  a  variety  of 
problem  sizes  and  target  resource  sets  (see  Figure  3).  For 
the  62  runs  shown  in  this  figure  the  model  accurately  pre¬ 
dicts  execution  time  within  1.5%,  on  average.  The  perfor¬ 
mance  model  consistently  achieved  this  level  of  accuracy 
for  other  runs  taken  under  similar  conditions.  Notice  that  as 
the  problem  size  becomes  larger,  the  smallest  pool  that  we 
test  also  increases  (i.e.  the  smallest  pool  for  a  problem  size 
of  2000  is  of  size  4  while  for  a  problem  size  of  6000  it  is 
12).  This  experimental  setup  was  required  by  a  limit  in  the 
g77  FORTRAN  compiler  we  employed:  no  more  than  507 
work  units  could  be  allocated  to  any  one  processor  during 
the  computation. 

Figure  3  demonstrates  the  importance  of  selecting  an  ap¬ 
propriate  number  of  target  resources  for  PMHD3D.  For  ex¬ 
ample,  for  a  problem  size  of  1000  the  minimal  execution 
time  is  achieved  when  the  application  is  run  on  10  proces¬ 
sors.  If  fewer  processors  are  used,  the  amount  of  work  per 
processor  is  high  and  the  overall  execution  time  is  higher. 
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Number  of  processors 

Figure  3.  Model  predictions  (dashed  lines)  and  observed  execution  time  (solid  lines)  for  a  variety  of 
problem  sizes  and  pool  sizes. 


Table  1.  Number  of  resources  to  target  for 
various  problem  sizes  under  unloaded  con¬ 
ditions.  Optimal  is  the  best  choice,  range  in¬ 
dicates  close  to  optimal  choices. 


Size 

1000 

2000 

3000 

4000 

5000 

6000 

Hosts 

10 

12 

14 

16 

18 

18 

Range 

8-12 

12-14 

14-16 

14-18 

16-18 

18-20 

If  more  processors  are  used,  the  added  communication  and 
system  overheads  cannot  be  offset  by  the  advantage  of  the 
additional  computational  power.  Significantly,  the  perfor¬ 
mance  model  accurately  tracks  the  knee  (i.e.  inflection 
point)  in  the  curve  and  is  thus  capable  of  predicting  the  cor¬ 
rect  number  of  target  resources,  at  least  under  these  con¬ 
ditions.  We  report  the  optimal  number  of  target  resources 
for  all  problem  sizes  tested  in  Table  1 .  As  will  be  obvious 
in  Section  4.3,  the  optimal  number  of  processors  may  vary 
with  resource  performance  and  dynamic  system  conditions 
as  well  as  with  problem  size. 

Figure  4  demonstrates  the  scheduling  advantage  of  accu¬ 
rately  predicting  the  correct  number  of  processors  to  target. 
In  these  experiments  the  PMHD3D  AppLeS  was  allowed  to 
select  any  number  of  processors  up  to  the  maximum  pool 


size.  The  PMHD3D  AppLeS  selects  the  maximum  num¬ 
ber  of  resources  for  each  resource  pool  up  to  and  including 
a  size  of  18.  For  resource  pools  of  size  20  and  larger  the 
optimal  number  of  hosts  is  18  and  the  PMHD3D  AppLeS 
correctly  selects  only  18  hosts. 


Figure  4.  PMHD3D  AppLeS  predicted  and  ac¬ 
tual  execution  times  for  a  problem  size  of 
5000. 


4.3.  Performance  Results 

Once  we  verified  that  the  performance  model  is  accu¬ 
rate  in  a  predictable  environment  (i.e.  where  resources 
are  dedicated),  we  turned  our  attention  to  considering  the 


performance  of  the  AppLeS  in  a  more  dynamic,  unpre¬ 
dictable,  multi-user  environment.  We  begin  by  investigat¬ 
ing  the  ability  of  PMHD3D  AppLeS  to  compare  available 
resources  and  select  desirable  hosts  (computationally  fast, 
well-connected,  or  both).  To  provide  a  comparison  point  we 
test  the  performance  of  another  available  scheduler,  namely 
the  default  Legion  scheduler.  We  conducted  experiments  in 
runs ,  namely  back-to-back  PMHD3D  executions  using  the 
same  resource  pool  and  the  same  problem  size  but  utiliz¬ 
ing  the  PMHD3D  AppLeS  scheduler  first  and  the  default 
Legion  scheduler  second. 


Figure  5.  PMHD3D  performance  attained  with 
and  without  the  AppLeS  scheduler  for  a  prob¬ 
lem  size  of  1000. 


In  Fig.  5  we  show  a  series  of  runs  comparing  the 
two  schedulers  for  a  problem  size  of  1000.  Clearly,  the 
PMHD3D  AppLeS  provides  a  performance  advantage  for 
all  resource  set  sizes  tested.  However,  it  is  notable  that  the 
two  execution  time  curves  follow  the  same  trend  only  when 
the  resource  pool  is  in  the  range  of  4-12  hosts.  When  more 
resources  are  added  to  the  pool  the  execution  time  achieved 
with  the  PMHD3D  AppLeS  remains  constant  while  the  de¬ 
fault  Legion  scheduler  execution  time  diverges.  The  default 
Legion  scheduler  allocates  all  available  resources,  a  less 
than  optimal  strategy  for  PMHD3D.  In  Table  2  we  report 
the  typical  number  of  processors  selected  by  AppLeS  for 
different  problem  sizes  and  resource  set  sizes. 

For  pool  sizes  of  4  —  12  performance  achieved  via  the 
PMHD3D  AppLeS  is  consistently  20  —  25  seconds  lower 
than  that  achieved  via  the  default  scheduler.  In  this  range 
of  pool  sizes,  the  PMHD3D  AppLeS  selects  the  maximum 
number  of  hosts  available  and  so  uses  the  same  number  of 
resources  as  the  default  Legion  scheduler.  The  performance 
advantage  is  achieved  by  selecting  “desirable”  resources, 
i.e.  resources  that  are  computationally  fast  and/or  well- 
connected.  Figure  6  illustrates  the  load  of  all  available  ma¬ 
chines  just  before  scheduling  occurred  for  the  18-processor 
run  shown  in  Figure  5.  Clearly,  the  PMHD3D  AppLeS  se¬ 
lects  lightly  loaded  hosts  (i.e.  those  hosts  with  high  avail¬ 
ability)  while  the  default  scheduler  selects  several  loaded 
hosts.  It  is  the  load  on  these  selected  machines  that  causes 


Table  2.  Hosts  chosen  by  PMHD3D  AppLeS. 
The  Legion  default  scheduler  always  selects 
the  maximum  number  of  hosts. 


Max  Hosts 

1000 

2000 

Problem  Size 

4000 

5000 

6000 

4 

4 

4 

6 

6 

6 

8 

8 

8 

8 

10 

10 

10 

10 

10 

12 

10 

12 

12 

12 

12 

14 

10 

12 

14 

14 

14 

16 

10 

12 

14 

16 

16 

18 

10 

12 

16 

16 

18 

20 

10 

12 

14 

18 

20 

22 

10 

12 

14 

18 

20 

24 

10 

14 

14 

18 

18 

26 

10 

14 

14 

18 

18 

a  performance  disadvantage  for  the  default  scheduler.  In  a 
more  heterogeneous  network  environment  the  connectivity 
of  the  hosts  would  also  play  an  important  role  in  host  selec¬ 
tion  and  resulting  performance. 

We  obtained  83  runs  comparing  the  default  Legion 
scheduler  to  the  PMHD3D  AppLeS  for  a  variety  of  problem 
sizes  (1000-6000)  and  pool  sizes  (4-26).  Figure  7  shows  a 
histogram  of  the  percent  improvement  the  PMHD3D  Ap¬ 
pLeS  achieved  over  the  default  Legion  scheduler  for  the  83 
runs  (the  average  improvement  was  30%). 

Note  that  in  a  few  runs  there  was  little  or  no  advantage 
to  using  the  PMHD3D  AppLeS.  In  these  cases  the  proces¬ 
sors  were  essentially  idle  and  the  pool  size  was  below  the 
optimal  number  so  that  the  schedulers  selected  the  same 
number  of  processors.  In  one  run  the  PMHD3D  AppLeS- 
determined  schedule  was  considerably  slower  than  that  de¬ 
termined  by  the  default  Legion  scheduler.  In  this  case  the 
scheduler  created  a  schedule  based  on  incorrect  system  in¬ 
formation:  NWS  forecasts  of  CPU  availability  were  unable 
to  a  predict  a  sudden  change  in  load  on  several  machines 
and  the  resulting  schedule  was  poorly  load  balanced. 

The  Legion  default  scheduler  was  designed  to  provide 
general  scheduling  services,  not  the  specialized  services  we 
include  in  the  PMHD3D  AppLeS.  It  is  therefore  not  sur¬ 
prising  that  the  AppLeS  is  better  able  to  promote  applica¬ 
tion  performance.  In  fact,  the  PMHD3D  AppLeS  could  be 
developed  as  a  Legion  object  for  scheduling  regular,  iter¬ 
ative,  data-parallel  computations,  and  this  is  a  focus  of  fu¬ 
ture  work.  Using  the  PMHD3D  AppLeS  and  the  Legion  de¬ 
fault  scheduling  strategy  as  extremes,  we  wanted  to  explore 
a  third  alternative  for  scheduling  -  that  of  what  a  “smart 
user”  might  do:  In  a  typical  user  scenario  for  a  cluster  of 
machines  a  user  will  have  access  to  a  large  number  of  ma¬ 
chines  and  will  typically  do  a  back-of-the-envelope  static 
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Figure  6.  A  snapshot  of  CPU  availability  taken  Figure  8.  Performance  obtained  by  three 

during  scheduling  for  the  18-processor  run  schedulers  when  each  was  given  access  to 

shown  in  Figure  5.  at  most  26  processors. 


Percent  Improvement 


Figure  7.  Range  of  performance  improvement 
obtained  by  PMHD3D  AppLeS. 


calculation  to  determine  an  appropriate  number  of  target  re¬ 
sources  given  the  granularity  of  the  application.  Although  a 
user  may  correctly  determine  the  number  of  hosts  to  target, 
accurate  information  on  resource  load  and  availability  will 
be  difficult  or  impossible  to  obtain  and  interpret  prior  to  or 
at  compile-time. 

To  simulate  this  user  scenario,  we  developed  a  third 
scheduling  method  called  the  smart  user.  The  smart  user 
selects  an  appropriate  number  of  hosts  but  does  not  select 
hosts  based  on  desirability.  Experiments  were  performed 
for  problem  sizes  ranging  from  1000  to  6000  with  a  pool 
size  of  26  hosts.  Figure  8  shows  the  performance  obtained 
by  the  PMHD3D  AppLeS,  the  default  Legion  scheduler, 
and  that  obtained  by  the  smart  user.  In  these  experiments, 
the  PMHD3D  AppLeS  provides  a  significant  performance 
advantage  over  both  alternatives. 

5.  Related  Work 

The  PMHD3D  AppLeS  is  an  adaptation  and  extension  of 
previous  work  targeting  the  structurally  similar  Jacobi-2D 
application  ([2], [3]).  Jacobi-2D  is  a  data-parallel,  stencil- 
based  iterative  code,  as  is  PMHD3D.  Both  applications  al¬ 
low  non-uniform  work  distribution,  however  Jacobi-2D  em¬ 
ploys  strip  decomposition  (using  strip  widths)  for  its  2- 


dimensional  grid  while  PMHD3D  employs  slab  decompo¬ 
sition  (using  slab  height)  for  its  3-dimensional  grid.  While 
the  applications  are  structurally  similar,  PMHD3D  required 
tighter  constraints  on  memory  availability  and  a  more  com¬ 
plex  performance  model.  Additionally,  PMHD3D  was  tar¬ 
geted  for  a  much  larger  resource  set  (34  machines  vs.  8). 
The  availability  of  a  larger  resource  pool  for  this  work  mo¬ 
tivated  the  introduction  of  the  quadratic  overhead  term  in 
the  PMHD3D  performance  model.  Previous  AppLeS  work 
has  not  included  the  additional  overhead  of  using  extra  ma¬ 
chines  in  scheduling  decisions. 

As  part  of  our  previous  work,  we  developed  an  AppLeS 
for  Complib  and  the  Mentat  distributed  programming  en¬ 
vironment.  Complib  implements  a  genetic  sequencing  al¬ 
gorithm  for  libraries  of  sequences.  It  is  particularly  diffi¬ 
cult  to  schedule  because  of  its  highly  data  dependent  exe¬ 
cution  profile.  The  implementation  of  Complib  we  chose 
was  for  Mentat  [8]  which  is  an  early  prototype  of  the  Le¬ 
gion  Grid  software  infrastructure.  By  combining  a  fixed 
initial  distribution  strategy  (based  on  a  combination  of  ap¬ 
plication  characteristics  and  NWS  forecasts)  with  a  shared 
work-queue  distribution  strategy,  the  Complib  AppLeS  was 
able  to  achieve  large  performance  improvements  in  dif¬ 
ferent  Grid  settings  [20],  In  addition  to  AppLeS  for  Le- 
gion/Mentat  applications,  we  have  developed  AppLeS  for  a 
variety  of  Grid  infrastructures  and  applications  [19,  21,  7], 

In  [10],  the  authors  describe  a  scheduler  targeting  data 
parallel  “stencil”  applications  that  use  the  Mentat  program¬ 
ming  system.  They  specifically  examine  Gaussian  elimi¬ 
nation  using  a  master/slave  work-distribution  methodology. 
While  it  is  difficult  to  compare  the  performance  of  each  sys¬ 
tem,  their  approach  differs  from  AppLeS  in  that  it  requires 
more  extensive  modification  of  the  application  and  it  does 
not  incorporate  dynamic  information. 


6.  New  Directions 

An  ultimate  goal  is  to  offer  the  PMHD3D  AppLeS  agent 
within  the  Legion  framework  as  a  default  scheduler  for  it¬ 
erative,  regular,  stencil-based  distributed  applications.  In 
particular,  the  scheduler’s  performance  model  is  flexible 
enough  to  incorporate  the  requirements  and  constraints  of 
other  stencil  applications  and  the  characteristics  of  other 
platforms.  To  use  this  model  for  other  appropriate  appli¬ 
cations,  good  predictions  of  megabytes  transferred,  number 
of  messages  initiated,  overhead  factor,  benchmarks  for  pro¬ 
gram  CPU  and  memory  utilization  over  the  different  target 
architectures,  as  well  as  access  to  dynamic  system  infor¬ 
mation  from  NWS  or  a  similar  system  would  be  required. 
Once  obtained,  these  characteristics  are  used  as  inputs  to 
the  model  without  changing  the  model  structure. 

Portability  and  heterogeneity  are  also  important.  The 
AppLeS  itself  is  written  in  C  and  Perl  and  has  been  com¬ 
piled  successfully  and  executed  on  various  architectures  and 
systems  (Pentium,  Alpha,  Linux  and  Solaris).  Initial  results 
indicate  that  the  scheduler  can  be  used  effectively  on  dif¬ 
ferent  target  environments  without  changes  to  the  structure 
of  the  performance  model.  For  example,  we  used  mpich  on 
a  local  cluster  for  initial  development  and  debugging.  The 
schedule  worked  well  with  only  the  previously  described 
changes  in  model  input  parameters. 
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